In today's lab we will go over fitting and interpreting a Poisson GLM, with special focus on how to check and account for overdispersion. We will also go over an example showing how the Newton-Raphson algorithm works.

# install pacman
if(!require(pacman)){ install.packages("pacman")}

# load necessary packages
library(pacman)
p_load(tidyverse, 
       ggplot2,
       janitor, #clean_names() function
       kableExtra, #printing nice tables
       skimr, #data summary
       gganimate, #turn ggplots into animations
       transformr,#turn ggplots into animations
       data.table, #read in table from URL
       AER, #overdispersion tests
       DHARMa) #overdispersion tests

Coding Example (Poisson GLMs/Overdispersion)

The Data

These data contain information about horseshoe crab breeding. Female horseshoe crabs migrate to the shore to breed with a male crab attached to her spine, then burrows into the sand where she lays her eggs. Eggs are fertilized externally by the male. Other males may cluster around the pair and may also fertilize eggs. These males are called satellites. Our outcome of interest is number of satellites per female crab. We have the following data about the female crab:

  • color :
    • 1 = medium light
    • 2 = medium
    • 3 = medium dark
    • 4 = dark
  • spine condition :
    • 1 = both good
    • 2 = one worn or broken
    • 3 = both worn or broken
  • Carapace width in cm
  • Weight in kg
crab_df = fread("http://users.stat.ufl.edu/~aa/glm/data/Crabs.dat")

skim(crab_df)
Data summary
Name crab_df
Number of rows 173
Number of columns 6
Key NULL
_______________________
Column type frequency:
numeric 6
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
crab 0 1 87.00 50.08 1.0 44.0 87.00 130.00 173.0 ▇▇▇▇▇
y 0 1 2.92 3.15 0.0 0.0 2.00 5.00 15.0 ▇▃▁▁▁
weight 0 1 2.44 0.58 1.2 2.0 2.35 2.85 5.2 ▅▇▃▁▁
width 0 1 26.30 2.11 21.0 24.9 26.10 27.70 33.5 ▂▇▇▂▁
color 0 1 2.44 0.80 1.0 2.0 2.00 3.00 4.0 ▁▇▁▃▂
spine 0 1 2.49 0.83 1.0 2.0 3.00 3.00 3.0 ▂▁▁▁▇

Plotting the data

First, we plot a histogram of the outcome variables.

# histogram of y
crab_df%>%
  ggplot(aes(x = y))+
  geom_histogram(binwidth = 1, color = "black")+
  labs(x = "# satellites")+
  theme_bw()

Suppose we just want to look at the relationship between weight and number of satellites. Below is a plot of that relationship.

crab_df%>%
  ggplot(aes(x = weight, y = y))+
  geom_point(size = 0.5)+
  theme_bw()

Because this is count data with relatively rare events (not many satellites per female), we may be inclined to model this using a Poisson GLM. However, we should check a few assumptions first. Namely, a Poisson model assumes that the variance is equal to the mean at any given level of the covariates. Let's check this.

Checking for Overdispersion

Below I'm checking the mean and variance within subgroups of our covariate x (weight). Because weight is a continuous variable, I'm going to group them into intervals and calculate the mean and variance within each group to get an idea of if the mean is about equal to the variance at each level of x. Notice that the variance tends to be much larger than the mean (up to 4 times larger in some subgroups).

# round down to nearest x.5 weight and calculate mean and variance of outcome
crab_df%>%
  mutate(weight = plyr::round_any(weight, .5, floor))%>%
  group_by(weight)%>%
  summarise(n = n(),
    mean = mean(y),
            var = var(y),
            ratio = var/mean)
## # A tibble: 7 × 5
##   weight     n  mean    var  ratio
##    <dbl> <int> <dbl>  <dbl>  <dbl>
## 1    1       5  0.8   3.2    4    
## 2    1.5    32  1.38  5.66   4.12 
## 3    2      59  2.58  9.46   3.67 
## 4    2.5    42  3.14  6.47   2.06 
## 5    3      30  5    15.2    3.05 
## 6    3.5     4  4     0.667  0.167
## 7    5       1  7    NA     NA

If we run a simple Poisson GLM we might get a reasonable estimate for the mean, but the variance estimates (and hence, our statistical inference) will be too optimistic, with standard errors underestimated. Below we plot the fitted estimate for the Poisson model.

fit = glm( data = crab_df, y~ weight, family = "poisson")

int = fit$coefficients[1]%>%unname()

slope = fit$coefficients[2]%>%unname()


crab_df%>%
  ggplot(aes(x = weight, y = y))+
  geom_point(size = 0.5)+
  theme_bw()+
  stat_function(fun = function(weight) exp(int + slope*weight))

To show how far from a Poisson distribution we are, let's simulate what are data might have looked like if they came from a Poisson model without overdispersion. Namely, we fit a Poisson regression for the mean with weight as a predictor and get estimated coefficients (for the intercept, \(\hat{\beta}_0\) and slope, \(\hat{\beta}_1\)). Then, for each observation \(i\) in our dataset, we randomly sample from a Poisson distribution where the mean is our fitted value given weight \(w_i\). That is, we generate for observation \(i\) an outcome \(y~Pois(\hat{\beta}_0+w_i\hat{\beta}_1)\).

We then plot the data and compare it to our actual data. This will give us a sense of what we would expect if our data were truly Poisson distributed with no overdispersion as a means of comparison.

Simulated Poisson vs Real Data

set.seed(123)
sim = data.frame(weight = crab_df$weight)
sim$lambda = exp(int+slope*sim$weight)
sim$y = rep(NA, nrow(sim))
for(i in 1:nrow(sim)){
  sim$y[i] = rpois(n = 1, lambda = sim$lambda[i])
}

sim%>%
  dplyr::select(-lambda)%>%
  mutate(data_source = rep("Simulated", nrow(sim)))%>%
  bind_rows(crab_df%>%
              dplyr::select(weight,y)%>%
              mutate(data_source = rep("Real data", nrow(crab_df)) ))%>%
  ggplot(aes(x = weight, y = y))+
  geom_point(size = 0.5)+
  theme_bw()+
  stat_function(fun = function(weight) exp(int + slope*weight))+
  facet_wrap(.~data_source)

 sim%>%
  mutate(weight = plyr::round_any(weight, .5, floor))%>%
  group_by(weight)%>%
  summarise(n = n(),
    mean = mean(y),
            var = var(y),
            ratio = var/mean)
## # A tibble: 7 × 5
##   weight     n  mean   var  ratio
##    <dbl> <int> <dbl> <dbl>  <dbl>
## 1    1       5  1.8   2.2   1.22 
## 2    1.5    32  1.78  1.40  0.787
## 3    2      59  2.34  2.09  0.894
## 4    2.5    42  3.57  3.18  0.890
## 5    3      30  4.03  4.24  1.05 
## 6    3.5     4  5     4.67  0.933
## 7    5       1 12    NA    NA

Note that our data are much more dispersed than the simulated data.

Other packages to check for overdispersion

More formally, we can use packages such as the AER or DHARMa packages to check for overdispersion. These tests can be quite helpful when you are fitting a more complicated model with a larger set of predictors, in which case you wouldn't want to have to check for overdispersion at all levels of all combinations of your covariates (plus higher order terms, interaction terms, etc. would make it even more complicated).

# using AER functions
dispersiontest(fit)
## 
##  Overdispersion test
## 
## data:  fit
## z = 5.3758, p-value = 3.812e-08
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   3.116407
# using DHARMa functions
res = simulateResiduals(fit, refit=T) 
testDispersion(res)

## 
##  DHARMa nonparametric dispersion test via mean deviance residual fitted
##  vs. simulated-refitted
## 
## data:  res
## dispersion = 3.1427, p-value < 2.2e-16
## alternative hypothesis: two.sided
plot(res)

To see what these plots should look like if there is no dispersion, consider our simulated dataset.

fit_sim = glm(y~weight, data = sim, family = "poisson")

# using AER functions
dispersiontest(fit_sim)
## 
##  Overdispersion test
## 
## data:  fit_sim
## z = -1.4384, p-value = 0.9248
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.8773255
# using DHARMa functions
res_sim = simulateResiduals(fit_sim, refit=T) 
testDispersion(res_sim)

## 
##  DHARMa nonparametric dispersion test via mean deviance residual fitted
##  vs. simulated-refitted
## 
## data:  res_sim
## dispersion = 0.8862, p-value = 0.328
## alternative hypothesis: two.sided
plot(res_sim)

Clearly there is overdispersion... So what do we do?

Accounting for Overdispersion

One option is to fit a quasipoisson model, which allows for overdispersion. We can also account for overdispersion manually by adjusting standard errors by \(\sqrt{\hat{\phi}}\), where \(\hat{\phi}\) is our estimated dispersion parameter. We could also fit a negative binomial model instead, which has an additional parameter to handle variability independent of the mean (unlike Poisson). The function for fitting a negative binomial model in R is glm.nb() from the MASS package.

fit_weight = glm( data = crab_df, y~ weight, family = "poisson")

summary(fit_weight)
## 
## Call:
## glm(formula = y ~ weight, family = "poisson", data = crab_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9307  -1.9981  -0.5627   0.9298   4.9992  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.42841    0.17893  -2.394   0.0167 *  
## weight       0.58930    0.06502   9.064   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 560.87  on 171  degrees of freedom
## AIC: 920.16
## 
## Number of Fisher Scoring iterations: 5
# estimate dispersion parameter with deviance/(n-p)
disp = fit_weight$deviance/(173 - 2)
disp
## [1] 3.279921

If we fit the simple Poisson model, our estimated dispersion using the formula \[\phi \approx \frac{D}{n-p}, \] where D is the (residual) deviance, n is the number of observations, and p is the number of parameters in our model, we get \(\hat{\phi} \approx 3.28\). We can manually adjust our standard errors by multiplying them by the square root of our estimated dispersion parameter.

# manually adjust standard errors 
0.17893*sqrt(disp)
## [1] 0.3240521
0.06502*sqrt(disp)
## [1] 0.1177548

We can also adjust for overdispersion by fitting a quasipoisson model, which allows the variance to be a function of a parameter independent of the mean. That is, it models the variance as a function of the mean times some other unknown parameter \(Var(Y)= v(\mu)*\phi\).

# adjust standard errors using quasipoisson method
fit_weight_q = glm( data = crab_df, y~ weight, family = "quasipoisson")
summary(fit_weight_q)
## 
## Call:
## glm(formula = y ~ weight, family = "quasipoisson", data = crab_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9307  -1.9981  -0.5627   0.9298   4.9992  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.4284     0.3168  -1.352    0.178    
## weight        0.5893     0.1151   5.120 8.17e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.13414)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 560.87  on 171  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

We can see that the two methods give us very similar estimates for our standard errors, and hence confidence intervals.

# adusting for overdispersion using dispersion estimate
paste("(",round(exp(0.5893-1.96*sqrt(disp)*0.06502), digits = 2),", ",round(exp(0.5893+1.96*sqrt(disp)*0.06502), digits = 2), ")", sep = "")
## [1] "(1.43, 2.27)"
# adusting for overdispersion using quasipoisson
paste("(",round(exp(0.5893-1.96*.1151), digits = 2),", ",round(exp(0.5893+1.96*.1151), digits = 2), ")", sep = "")
## [1] "(1.44, 2.26)"

Fitting a Model

Obviously we would like to incorporate more predictors in the model, so now we fit a model with all covariates included.

fit = glm(y ~ weight + factor(color) + width + factor(spine) , data = crab_df, family = "quasipoisson")

summary(fit)
## 
## Call:
## glm(formula = y ~ weight + factor(color) + width + factor(spine), 
##     family = "quasipoisson", data = crab_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0290  -1.8630  -0.5988   0.9331   4.9446  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -0.36180    1.73852  -0.208   0.8354  
## weight          0.49647    0.29905   1.660   0.0988 .
## factor(color)2 -0.26485    0.30238  -0.876   0.3824  
## factor(color)3 -0.51371    0.35139  -1.462   0.1457  
## factor(color)4 -0.53086    0.40815  -1.301   0.1952  
## width           0.01675    0.08799   0.190   0.8493  
## factor(spine)2 -0.15037    0.38415  -0.391   0.6960  
## factor(spine)3  0.08728    0.21571   0.405   0.6863  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.235255)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 549.59  on 165  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

Checking Collinearity

# width vs weight
crab_df%>%
  ggplot(aes(x = weight, y = width))+
  geom_point()+
  theme_bw()

# highly correlated
cor(crab_df$weight, crab_df$width)
## [1] 0.8868715
# weight vs spine
crab_df%>%
  ggplot(aes(x = weight, y = spine))+
  geom_point(position = "jitter")+
  theme_bw()

# spine vs color
crab_df%>%
  ggplot(aes(x = color, y = spine))+
  geom_point(position = "jitter")+
  theme_bw()

Based on these plots, we probably want to drop width from our model, since it is highly correlated with weight.

fit2 = update(fit, . ~ . -width)

summary(fit2)
## 
## Call:
## glm(formula = y ~ weight + factor(color) + factor(spine), family = "quasipoisson", 
##     data = crab_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0433  -1.8609  -0.5944   0.9193   4.9483  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.04263    0.45452  -0.094    0.925    
## weight          0.54756    0.13119   4.174 4.83e-05 ***
## factor(color)2 -0.26768    0.30083  -0.890    0.375    
## factor(color)3 -0.52087    0.34803  -1.497    0.136    
## factor(color)4 -0.53966    0.40380  -1.336    0.183    
## factor(spine)2 -0.16066    0.37909  -0.424    0.672    
## factor(spine)3  0.09086    0.21419   0.424    0.672    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.213748)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 549.70  on 166  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6
fit3 = update(fit2, . ~ . -factor(spine))


broom::tidy(fit3)%>%
  kable(align = "c")%>%
  kable_styling(full_width = FALSE, bootstrap_options = "striped")
term estimate std.error statistic p.value
(Intercept) -0.0497806 0.4160024 -0.1196642 0.9048921
weight 0.5461798 0.1215331 4.4940832 0.0000130
factor(color)2 -0.2051125 0.2742671 -0.7478565 0.4555923
factor(color)3 -0.4497964 0.3135639 -1.4344649 0.1532985
factor(color)4 -0.4520498 0.3719136 -1.2154698 0.2258923
round(exp(0.5461798), digits = 2)
## [1] 1.73
paste("(",round(exp(0.5461798-0.1215331*1.96), digits = 2),", ",round(exp(0.5461798+0.1215331*1.96), digits = 2),")", sep = "" )
## [1] "(1.36, 2.19)"

Interpretation: For every one kg increase in weight, the number of male satellites is expected to increase by a factor of 1.73 (C.I. 1.36-2.19).

Final thoughts: If you want to fit a Poisson model, always check overdispersion before making inference from your model. If there is evidence of overdispersion in the model, make sure to take that into account when reporting standard errors, confidence intervals, etc. You can also choose to fit another model (quasi-likelihood models, negative binomial, etc.) that allows for variance to be modeled separately from the mean.

Newton-Raphson/Fisher Scoring Demonstration

Below I give a visual example of how the Newton Raphson algorithm works. First, I write a function to perform the algorithm (you did this on homework 1) and show that it gives the correct output by comparing to a GLM fit. I will be demonstrating the 1 dimensional case for purposes of visualization. The data come from HW1, I will be ignoring the intercept in this case so we only estimate one beta value.

The data

# the data (only choosing one covariate so I can show the method in 1D)
df = read.table("Ex0107.txt",header = T)  
design = df%>%
  dplyr::select(3)%>%
  data.frame()

response = df$y_response

The log likelihood

The point of Newton Raphson/Fisher Scoring is to derive the MLE in cases when we cannot get a closed form solution by just maximizing the likelihood function. Below I plot the likelihood of beta.

Recall that for binary data with \(P(y_i = 1)= \pi_i\), the likelihood is

\[\prod_{i=1}^n = \pi_i^{y_i}(1-\pi_i)^{1-y_i} \] and hence the log likelihood can be written as \[\sum_{i=1}^n y_i log(\pi_i)+(1-y_i)log(1-\pi_i) = \sum_{i=1}^n y_i log(\frac{\pi_i}{1-\pi_i})+log(1-\pi_i) \]

Recall that \(\pi_i\) will be modeled as \(\pi_i = expit(\sum_j x_{ij}\beta_j) = \frac{exp(X_i^T\beta)}{1+exp(X_i^T\beta)}\) so this log likelihood is a function of \(\beta\), i.e. \[L(\beta) = \sum_{i=1}^n = y_i log(\pi_i)+(1-y_i)log(1-\pi_i) \] depends on \(\beta\) in that \(\pi_i\) is a function of \(\beta\).

# function for log likelihood
loglik = function(beta, y = response, X = design){
  X = as.matrix(unname(design))
  y = as.vector(unname(response))
  beta = as.matrix(unname(beta))
  
  bx = X%*%beta
  sum(
    y*log(1/(1+exp(-bx)))+(1-y)*log(1-(1/(1+exp(-bx))))
  )

}


# estimate log likelihood at different values of beta
d = data.frame(x = seq(-1,3,by = 0.01))

loglik.vals = c()
for(i in 1:nrow(d)){
 l = loglik(y = response, beta = matrix(d$x[i]), X = design)
 loglik.vals = c(loglik.vals,l)
}
d$l = loglik.vals
  
# plot log likelihood
ggplot(data = d, aes(x = x, y = l))+
  geom_line()+
  labs(x = "beta", y = "log likelihood")+
  theme_bw()

Newton-Raphson/Fisher Scoring exploit the fact that the log likelihood of an exponential family distribution (when the natural parameter is used) is concave. The process is as follows:

    1. Start with an initial estimate of \(\beta\). (Here I have chosen the initial value to be 0).
    1. Approximate the log likelihood of beta using the second-order Taylor approximation of the log likelihood at your initial estimate. That is, if \(\beta^{(t)}\) is our current estimate of \(\beta\), we use the approximation:

\[L(\beta)\approx L(\beta^{(t)})+(\beta-\beta^{(t)})L'(\beta^{(t)})+\frac{1}{2}(\beta-\beta^{(t)})^2L''(\beta^{(t)}) \]

    1. Maximize this approximation of the log likelihood (which we can do because it is a concave degree 2 polynomial) to obtain a new estimate of \(\beta\), called \(\beta^{(t+1)}\).

If we take the derivative with respect to \(\beta\) and set the equation equal to zero we get

\[L'(\beta^{(t)}) + (\beta-\beta^{(t)})L''(\beta^{(t)})=0 \] \[\implies\beta^{(t+1)}= \beta^{(t)}-L'(\beta^{(t)})\big[L''(\beta^{(t)}\big]^{-1} \]

    1. Repeat this process until you have met some convergence threshold (i.e. \(|\beta^{(t)}-\beta^{(t+1)}|<\delta\) for some threshold \(\delta\)).

Note: for the higher dimension case we would replace these terms with their vector form. Here, \((\beta - \beta^{(t)})^2L''(\beta^{(t)})\) becomes \((\beta - \beta^{(t)})^TH(\beta - \beta^{(t)})\) where \(H\) is the Hessian matrix (matrix of second-order partial derivatives).

Recall: \[L'(\beta) = \sum_{i=1}^n \frac{\partial L}{\partial \pi_i}\frac{\partial \pi_i}{\partial \eta_i}\frac{\partial \eta_i}{\partial \beta} \] \[\frac{\partial L}{\partial \pi_i}= \frac{y_i}{\pi_i}-\frac{1-y_i}{1-\pi_i}=\frac{y_i-\pi_i}{\pi_i(1-\pi_i)}\] For \(\pi_i = \frac{exp(\eta_i)}{1+exp(\eta_i)}\) \[\frac{\partial \pi_i}{\partial \eta_i} = \frac{e^{\eta_i}(1+e^{\eta_i})-e^{\eta_i}(e^{\eta_i})}{(1+e^{\eta_i})^2} = \frac{e^{\eta_i}}{(1+e^{\eta_i})^2} = \frac{e^{\eta_i}}{1+e^{\eta_i}}\bigg(1-\frac{e^{\eta_i}}{1+e^{\eta_i}}\bigg) = \pi_i(1-\pi_i)\] And finally, \[\frac{\partial \eta_i}{\partial \beta}= x_i \] So we get \[L'(\beta) = \sum_{i=1}^n \frac{y_i-\pi_i}{\pi_i(1-\pi_i)}\pi_i(1-\pi_i)x_i=\sum_{i=1}^n(y_i-\pi_i)x_i \] \[L''(\beta) = -\sum_{i=1}^n \pi_i(1-\pi_i)x_i^2\]

The algorithm

Using the derivations above, we can implement the algorithm using our data.

###  Newton Raphson function
NR = function(X,y){
  X = as.matrix(X)
  y = as.vector(y)
  n = nrow(X)
  
  output = c()
  
  ### set initial beta value
  beta = matrix(0, nrow = ncol(X), ncol = 1)
  beta = as.matrix(beta)
  estimates = c(beta)
  
  ### define a difference variable
  delta = 1
  
  while(sum(abs(delta))>0.0000000001){
    eta = as.vector(X%*%beta)
    pi= as.vector(exp(eta)/(1+exp(eta)))
    W = diag(pi*(1-pi))
    var = solve(t(X)%*%W%*%X)
    delta = var%*%t(X)%*%(y-pi)
  
  ### update beta
  beta = beta - (-var%*%t(X)%*%(y-pi))
  estimates = c(estimates,as.vector(beta))
  }
  estimates = c(estimates,as.vector(beta))
  
  output$mle = beta
  output$var = as.matrix(var)
  output$estimate_history = matrix(estimates, ncol = ncol(X), byrow = T)
   return(output)
}
# get Newton-Raphson output from our function
nr = NR(design,response)
nr
## $mle
##                   [,1]
## x2_covariate2 1.258173
## 
## $var
##               x2_covariate2
## x2_covariate2     0.4850241
## 
## $estimate_history
##           [,1]
## [1,] 0.0000000
## [2,] 0.9225283
## [3,] 1.2064655
## [4,] 1.2568036
## [5,] 1.2581719
## [6,] 1.2581728
## [7,] 1.2581728
## [8,] 1.2581728
# check against glm
fit = glm(y_response~x2_covariate2-1, family = "binomial", data = df)
logLik(fit)
## 'log Lik.' -11.32955 (df=1)
loglik(beta = nr$mle)
## [1] -11.32955
# plot the model fit at each iteration (i.e. each value of beta that we try until convergence)
df%>%
  ggplot(aes(x = x2_covariate2, y = y_response))+
  geom_point()+
  stat_function(fun = function(x) exp(x*nr$mle)/(1+exp(x*nr$mle)), size = 2)+
  stat_function(fun = function(x) exp(x*nr$estimate_history[1])/(1+exp(x*nr$estimate_history[1])), color = "red")+
  stat_function(fun = function(x) exp(x*nr$estimate_history[2])/(1+exp(x*nr$estimate_history[2])), color = "orange")+
  stat_function(fun = function(x) exp(x*nr$estimate_history[3])/(1+exp(x*nr$estimate_history[3])), color = "green")+
  stat_function(fun = function(x) exp(x*nr$estimate_history[4])/(1+exp(x*nr$estimate_history[4])), color = "blue")+
  theme_bw()

Visualizations

# estimate log likelihood at estimate of beta for each iteration in NR
approx = data.frame(beta_new = NULL, l = NULL, iteration = NULL, beta_est = NULL)
for(i in 1:8){
for(b in seq(-1,3,by = .01)){
  # initial estimate
  b_old = nr$estimate_history[i]
  # log likelihood
  Lb = loglik(beta = b_old)
  # pi = expit(eta) = expit(Xbeta)
  pi = exp(design*b_old)/(1+exp(design*b_old))
  # first derivative/gradient
  u = sum((response-pi)*design )
  # second derivative/hessian
  H = -sum(pi*(1-pi)*design^2)

  # taylor approximation
  ll = Lb + u*(b-b_old) + (1/2)*(b-b_old)^2*H

  # update estimates
approx = rbind(approx, data.frame(beta_new = b,l = as.numeric(ll), iteration = i, beta_est = b_old))
}
}

approx = approx%>%
  mutate(next_est = ifelse(iteration<8, nr$estimate_history[iteration+1],nr$estimate_history[iteration]))

# specify labels
colors = c("Old Estimate" = "black", "New Estimate"= "red", "Taylor Approximation" = "blue")
sizes = c("Log likelihood"=.75)

ggplot(approx)+
  geom_line(data = d, aes(x = x, y = l, size = "Log likelihood"))+
  geom_line(data = approx, aes(x = beta_new, y = l, color = "Taylor Approximation"), linetype = "dashed")+
  labs(x = "beta", y = "log likelihood")+
  ylim(-23,-10)+
  theme_bw()+
  facet_wrap(.~iteration, labeller = "label_both")+
  geom_vline(aes(xintercept = beta_est, color = "Old Estimate"),linetype = "dotted")+
  geom_vline(aes(xintercept = next_est, color = "New Estimate"),linetype = "dotted")+
  scale_color_manual(values = colors, name = NULL)+
  scale_size_manual(values = sizes, name = NULL)+
  theme(legend.position = c(0.85,.1),legend.background = element_blank(),
        legend.box.background = element_rect(colour = "black"),legend.spacing.y = unit(-0.17, "cm"), legend.text = element_text(size = 5),legend.key.size = unit(.5, "cm"))

p = ggplot(approx)+
  geom_line(data = d, aes(x = x, y = l, size = "Log likelihood"))+
  geom_line(data = approx, aes(x = beta_new, y = l, color = "Taylor Approximation"), linetype = "dashed")+
  transition_states(
  iteration,
    transition_length = 1,
    state_length =2
  )+
  labs(x = "beta", y = "log likelihood", title = "Iteration = {closest_state}")+
  ylim(-23,-10)+
  theme_bw()+
  geom_vline(aes(xintercept = beta_est, color = "Old Estimate"),linetype = "dotted")+
  geom_vline(aes(xintercept = next_est, color = "New Estimate"),linetype = "dotted")+
  scale_color_manual(values = colors, name = NULL)+
  scale_size_manual(values = sizes, name = NULL)

animate(p)